Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run GC in file leak check #48294

Merged
merged 2 commits into from
Aug 31, 2022
Merged

Run GC in file leak check #48294

merged 2 commits into from
Aug 31, 2022

Conversation

jonashaag
Copy link
Contributor

This fixes a problem with file_leak_context (xref #30096 (comment)).

Demo:

import gc
import pandas as pd
import pandas.util._test_decorators as td

with td.file_leak_context():
    reader = pd.read_csv("t.csv", iterator=True)
    reader.cycle = reader
    del reader
    # gc.collect()

Here we have a reference cycle reader -> .cycle -> reader that is only cleared when GC is run. On my machine, without the gc.collect() statement, file_leak_context detects a file leak where there is none.

The PR fixes this by explicitly calling gc.collect() before collecting the leaked files. I also changed an equality to a subset check to make the code robust against additional closed files (due to the gc.collect() call, or other reasons that I don't have any examples for).

cc @jbrockmendel

  • closes #xxxx (Replace xxxx with the Github issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@jbrockmendel
Copy link
Member

does this merit a test?

@jonashaag
Copy link
Contributor Author

Happy to add one, but I think it is difficult to come up with one that is both not too contrived and guaranteed to work (the one above scores good on the former but bad on the latter)

@mroeschke mroeschke added Testing pandas testing functions or related to the test suite IO Data IO issues that don't fit into a more specific label labels Aug 29, 2022
@WillAyd
Copy link
Member

WillAyd commented Aug 30, 2022

I'm not sure I conceptually agree with explicitly running the garbage collector. Is this not something we should just handle via weak references?

@jonashaag
Copy link
Contributor Author

That would require all references that might indirectly reference any open files to be weak references. Actually I think it’s likely not even the standard library guarantees this. The standard idiom is to use with to make sure everything is closed immediately when you’re done but that’s not possible with Pandas’ idiom.

@WillAyd
Copy link
Member

WillAyd commented Aug 30, 2022

Ah sorry didn't notice the full context to this. Your explanation makes sense, though this seems symptomatic of other problems in the code base that ideally should be addressed

@mroeschke mroeschke added this to the 1.6 milestone Aug 30, 2022
@mroeschke mroeschke merged commit e7414aa into pandas-dev:main Aug 31, 2022
@mroeschke
Copy link
Member

Thanks @jonashaag

@mroeschke mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants